q21ubfandomcom-20200214-history
What is the technology behind Google and why people choose Google instead of Bing?
'A Short Answer' ---- A search engine operates in the following order http://en.wikipedia.org/wiki/Web_search_engine : *Web crawling http://info.webcrawler.com/mak/projects/robots/exclusion.htm *Indexing *Searching Web search engines http://www.searchenginewatch.com/ work by storing information about many web pages, which they retrieve from the HTML itself. These pages are retrieved by a Web crawler — an automated Web browser which follows every link on the site. The contents of each page are then analyzed to determine how it should be indexed (for example, words can be extracted from the titles, page content, headings, or special fields). Data about web pages are stored in an index database for use in later queries. A query can be a single word. The index helps find information as quickly as possible. Some search engines, such as Google, store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others, such as AltaVista , store every word of every page they find. This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. 'These are some targets Jeffrey A. Dean, Benedict Gomes, Krishna Bharat, Georges Harik, Monika H. Henzinger: united states patent application publication: pulication number :U.S. 2002/0123988 of the search engines' 1-Ranking search results by re-ranking the results based on local inter-connectivity: A search engine improves the relevancy of the results by refining a standard relevancy score based on the interconnectivity of the initially returned set of documents. The search engine obtains an initial set of relevant documents by matching a user's search terms to an index. A re-ranking component in the search engine then refines the initially returned document rankings so that documents that are frequently cited in the initial set of relevant documents are preferred over documents that are less frequently cited within the initial set. 2-Detecting duplicate and near-duplicate files: Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document (ii)assigning the extracted parts to one or more of a predetermined number of lists, and (iii) Generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match. 3-Detecting query-specific duplicate documents: An improved duplicate detection technique that uses query-relevant information to limit the portion(s) of documents to be compared for similarity is described. Before comparing two documents for similarity, the content of these documents may be condensed based on the query. 4-Techniques for finding related hyperlinked documents using link-based analysis: Backlink and forward link sets can be utilized to find web pages that are related to a selected web page. The scores for links from web pages that are from the same host and links from web pages with numerous links can be reduced to achieve a better list of related web pages. 5-Methods and apparatus for employing usage statistics in document retrieval: Methods and apparatus provide improved organization of documents responsive to a search query. In one embodiment, a search query is received and a list of responsive documents is identified. The responsive documents are organized based in whole or in part on usage statistics. GoogleThe Anatomy of a Large-Scale Hypertextual Web Search Engine: Sergey Brin and Lawrence Page http://google.stanford.edu/is designed to crawl and index the Web efficiently and produce much more satisfyi ng search results than existing systems. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. The name, Google, is a common spelling of googol, or 10100 and fits well with the goal of building very large-scale search engines. Creating a search engine which scales even to today's web presents many challenges. Fast crawling technology is needed to gather the web documents and keep them up to date. Storage space must be used efficiently to store indices. The indexing system must process hundreds of gigabytes of data efficiently. Queries must be handled quickly, at a rate of hundreds to thousands per second. Google is designed to scale well to extremely large data sets. It makes efficient use of storage space to store the index. Its data structures are optimized for fast and efficient access. Google has an architecture that can support novel research activities on large-scale web data. It stores all of the actual documents it crawls in compressed form. 'System Features' 'PageRank' PageRank is an objective measure of citation importance that corresponds well with people's subjective idea of importance. Because of this correspondence, PageRank is an excellent way to prioritize the results of web keyword searches. 'Anchor Text' The text of links is treated in a special way Google search engine. Most search engines associate the text of a link with the page that the link is on. Google associates it with the page the link points to. This has several advantages. First, anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases. This makes it possible to return web pages which have not actually been crawled. 'Other Features:' Aside from PageRank and the use of anchor text, Google has several other features. First, it has location information for all hits and so it makes extensive use of proximity in search. Second, Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font are weighted higher than other words. Third, full raw HTML of pages is available in a repository. 'Google Architecture Overview' In this section, a high level overview of how the whole system works is depicted as pictured in Figure 1.Most of Google is implemented in C or C++ for efficiency. ''' *In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. *There is a URL server that sends lists of URLs to be fetched to the crawlers. *The web pages that are fetched are then sent to the store server. *The store server then compresses and stores the web pages into a repository. *Every web page has an associated ID number called a doc ID which is assigned whenever a new URL is parsed out of a web page. *The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. *Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. *The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link. *The URL resolver reads the anchors file and converts relative URLs into absolute URLs and in turn into doc IDs. It puts the anchor text into the forward index, associated with the doc ID that the anchor points to. It also generates a database of links which are pairs of doc IDs. *The links database is used to compute Page Ranks for all the documents. *The sorter takes the barrels, which are sorted by doc ID and resorts them by word ID. *A program called Dump Lexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. *The searcher is run by a web server and uses the lexicon built by Dump Lexicon together with the inverted index and the Page Ranks to answer queries. '''A Long Answer ---- 'Crawling the Web' The most important part of a search engine is the crawler which is different in search engines. For Google, the major operations are Crawling, Indexing, and Sorting. Presently we study in what order a crawler should visit the URLs it has seen, in order to obtain more “important” pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. Several importance metrics, ordering schemes, and performance evaluation measures for this problem is defined. A crawler is a program that retrieves Web pages, commonly for use by a search engine or a Web cache. Roughly, a crawler starts off with the URL for an initial page p0. It retrieves p0, extracts any URLs in it, and adds them to a queue of URLs to be scanned. Then the crawler gets URLs from the queue (in some order), and repeats the process. Every page that is scanned is given to a client that saves the pages, creates an index for the pages, or summarizes or analyzes the content of the pages. Crawlers for the major search engines (e.g., AltaVista, InfoSeek, Excite, and Lycos) attempt to visit most text Web pages, in order to build content indexes. Other crawlers may also visit many pages, but may look only for certain types of information (e.g., email addresses). The design of a good crawler presents many challenges. Externally, the crawler must avoid overloading Web sites or network links. Internally, the crawler must deal with huge volumes of data. Unless it has unlimited computing resources and unlimited time, it must carefully decide what URLs to scan and in what order. The crawler must also decide how frequently to revisit pages it has already seen, in order to keep its client informed of changes on the Web. In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URL server serves lists of URLs to a number of crawlers (we typically ran about 3). Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. 'Importance metrics' Not all pages are necessarily of equal interest to a crawler’s client Junghoo Cho, Hector Garcia-Molina, Lawrence Page. Efficient Crawling Through URL Ordering. Seventh International Web Conference (WWW 98). Brisbane, Australia, April 14-18, 1998.. For instance, if the client is building a specialized database on a particular topic, then pages that refer to that topic are more important, and should be visited as early as possible. Similarly, a search engine may use the number of Web URLs that point to a page, the so-called backlink count, to rank user query results. If the crawler cannot visit all pages, then it is better to visit those with a high backlink count, since this will give the end-user higher ranking results. Given a Web page p, we can define the importance of the page, I(p), in one of the following ways. 1. Similarity to a Driving Query Q: A query Q drives the crawling process, and I(p) is defined to be the textual similarity between p'' and ''Q. We use IS(p) to refer to the importance metric in this case. We also use IS(p,Q) when we wish to make the query explicit. To compute similarities, we can view each document (p'' or ''Q) as an n-'' dimensional vector ''w1, . . . ,wn. The term wi in this vector represents the i'' th word in the vocabulary. If ''wi does not appear in the document, then wi is zero. If it does appear, wi is set to represent the significance of the word. One common way to compute the significance wi is to multiply the number of times the i''th word appears in the document by the inverse document frequency (''idf) of the i''th word. The ''idf factor is one divided by the number of times the word appears in the entire “collection” which in this case would be the entire Web. The idf factor corresponds to the content discriminating power of a word: A term that appears rarely in documents (e.g., “queue”) has a high idf, while a term that occurs in many documents (e.g., “the”) has a low idf. (The wi terms can also take into account where in a page the word appears. The similarity between p'' and ''Q can then be defined as the inner product of the p'' and ''Q vectors. Another option is to use the cosine similarity measure, which is the inner product of the normalized vectors. 2. Backlink Count: The value of I(p) is the number of links to p'' that appear over the entire Web. We use ''IB(p) to refer to this importance metric. Intuitively, a page p'' that is linked to many pages is more important than one that is seldom referenced. This type of “citation count” has been used extensively to evaluate the impact of published papers. Evaluating ''IB(p) requires counting backlinks over the entire Web. A crawler may estimate this value with , the number of links to p'' that have been seen so far. '''3. PageRank :' Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web The IB(p) metric treats all links equally. Thus, a link from the Yahoo home page counts the same as a link from some individual’s home page. However, since the Yahoo home page is more important (it has a much higher IB count), it would make sense to value that link more highly. The PageRank backlink metric, IR(p), recursively defines the importance of a page to be the weighted sum of the importance of the pages that have backlinks to p''. We use for the estimated value of ''IR(p) when we have only a subset of pages available. More formally, if a page has no outgoing link, we assume that it has outgoing links to every single Web page. Next, consider a page p'' that is pointed at by pages ''t1, . . . , tn. Let ci be the number of links going out of page ti. Also, let d be a damping factor, Then, the weighted back link count of page p'' is given by ''IR(p) = (1 − d) + d + ・ ・ ・ + IR(tn)/cn This leads to one equation per Web page, with an equal number of unknowns. The equations can be solved for the IR values. They can be solved iteratively, starting with all IR values equal to 1. At each step, the new IR(p) value is computed from the old IR(ti) values (using the equation above), until the values converge. This calculation corresponds to computing the principal eigenvector of the link matrices. 4. Forward Link Count: For completeness we may want to consider a metric IF(p) that counts the number of links that emanate from p. Under this metric, a page with many outgoing links is very valuable, since it may be a Web directory. This metric can be computed directly from p''. '''5. Location Metric:' The IL(p) importance of page p is a function of its location, not of its contents. If URL u'' leads to ''p, then IL(p) is a function of u. For example, URLs ending with “.com” may be deemed more useful than URLs with other endings, or URLs containing the string “home” may be more of interest than other URLs. As stated earlier, our importance metrics can be combined in various ways. For example, we may define a metric IC(p) = k1IS(p,Q)+k2 IB(p), for some constants k1, k2. This combines the similarity metric (under some given query Q'') and the backlink metric. Pages that have relevant content and many backlinks would be the highest ranked. The difference between crawlers and consequently difference between Search engines emanates from this metric which is different in search engines. 'Problem definition' Our goal is to design a crawler that if possible visits high ''I(p) pages before lower ranked ones, for some definition of I(p) by using the Crawl & Stop with Threshold criteria: At this point a perfect crawler would have visited pages r1, . . . , rK, where r1 is the page with the highest importance value, r2 is the next highest, and so on. We call pages r1 through rK the “'hot'” pages. We assume that the crawler visits K'' pages. We are given an importance target ''G, and any page with I(p) ≥ G is considered hot. Let us assume that the total number of hot pages is H''. The performance of the crawler, ''PST ©, is the fraction of the H'' hot pages that have been visited when the crawler stops. If ''K http://www.netconnexion.com/blog/differences-between-google-and-bing is a web search engine from Microsoft. Although Google and Bing both offer instant search features which display potential search queries as you're typing into the search box, Google's instant search tends to provide more relevant results, more quickly. While Bing's recent changes make it a more viable search option, it's still lacking in terms of result quality in response to informative queries. In many ways, Bing's results look like Google's used to. Everyone's search engine use comes down to personal preference. People mostly stick with Google until Bing is able to claim a much greater advantage over the web's largest, most preferred search engine. Bing has a long way to catch up with Google, if that ever happens. Some important differences between Google and Bing are: '''1) Different weight to signals:' Every search engine can give a different weight to the hundreds of signals found on other web pages as well as on site factors. So what Google deems important may not be viewed as quite as important by Bing. Even minute weightings in favor of one signal over another can have significant impact on the results. 2) Importance of links Google puts a greater emphasis on highly credible link sources than does Bing. 3) Canonical requirements Google is very good at figuring out a website’s canonical URL (the main/most important page to be used and indexed over other, duplicate pages) even if a website is not coded to properly return the canonical URL. Google’s Webmaster Tools even allows website owners to manage their canonical URL without changing any code. Also, Google supports the use of the canonical tag as a way for website owners to easily avoid duplicate content issues. Bing, on the other hand, does not support the canonical tag and does not offer canonical URL management in their Webmaster Center. There are more differences, but the above examples are some of the most important differences between the two search engines. ---- References ----